[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949
[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949AndreasKaratzas wants to merge 26 commits intovllm-project:mainfrom
Conversation
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant optimizations to the ROCm Docker build process by leveraging docker bake, multi-stage builds, and caching mechanisms like ccache. The new ci-bake.sh script centralizes and improves the CI build logic, enhancing build times and reliability. The changes are well-structured and thoughtful. I've identified a couple of critical issues related to missing runtime dependencies in the Dockerfile and a high-severity issue regarding configuration consistency in the new bake script.
docker/Dockerfile.rocm
Outdated
| RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \ | ||
| uv pip install --system /deep_install/*.whl |
There was a problem hiding this comment.
The rocshmem library appears to be a runtime dependency for deepep. This test stage installs the deepep wheel but no longer copies the rocshmem installation from the build stage. This could lead to runtime errors if the deepep wheel does not bundle the rocshmem shared libraries. Please restore the copy of the rocshmem directory from the build_rocshmem stage to ensure deepep can function correctly.
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem
| RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \ | ||
| uv pip install --system /deep_install/*.whl |
There was a problem hiding this comment.
Similar to the test stage, the final stage now installs the deepep wheel but is missing the rocshmem runtime libraries which are likely a runtime dependency. This is likely to cause runtime failures. Please add a COPY instruction to include the rocshmem installation from the build_rocshmem stage.
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem
.buildkite/scripts/ci-bake.sh
Outdated
| # Check if baked-vllm-builder already exists and is using the socket | ||
| if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then | ||
| echo "Using existing baked-vllm-builder" | ||
| docker buildx use baked-vllm-builder | ||
| else | ||
| echo "Creating baked-vllm-builder with remote driver" | ||
| docker buildx create \ | ||
| --name baked-vllm-builder \ | ||
| --driver remote \ | ||
| --use \ | ||
| "unix://${BUILDKIT_SOCKET}" | ||
| fi |
There was a problem hiding this comment.
There's an inconsistency in the buildx builder naming. The script accepts a BUILDER_NAME environment variable (defaulting to vllm-builder), but when a local buildkitd socket is detected, it hardcodes the builder name to baked-vllm-builder. This could lead to confusion and incorrect builder usage if BUILDER_NAME is customized. For consistency, please use the ${BUILDER_NAME} variable throughout the script.
| # Check if baked-vllm-builder already exists and is using the socket | |
| if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then | |
| echo "Using existing baked-vllm-builder" | |
| docker buildx use baked-vllm-builder | |
| else | |
| echo "Creating baked-vllm-builder with remote driver" | |
| docker buildx create \ | |
| --name baked-vllm-builder \ | |
| --driver remote \ | |
| --use \ | |
| "unix://${BUILDKIT_SOCKET}" | |
| fi | |
| # Check if ${BUILDER_NAME} already exists and is using the socket | |
| if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then | |
| echo "Using existing builder: ${BUILDER_NAME}" | |
| docker buildx use "${BUILDER_NAME}" | |
| else | |
| echo "Creating builder '${BUILDER_NAME}' with remote driver" | |
| docker buildx create \ | |
| --name "${BUILDER_NAME}" \ | |
| --driver remote \ | |
| --use \ | |
| "unix://${BUILDKIT_SOCKET}" | |
| fi |
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
docker/Dockerfile.rocm
Outdated
| apt-transport-https ca-certificates wget curl | ||
| apt-transport-https ca-certificates wget curl \ | ||
| ccache mold \ | ||
| && update-alternatives --install /usr/bin/ld ld /usr/bin/mold 100 |
There was a problem hiding this comment.
Changing the system loader hardly falls under "Install some basic utilities"
Could you at least provide the motivation for this in the PR description?
There was a problem hiding this comment.
Correct, mb, I updated the comment there as well.
docker/Dockerfile.rocm
Outdated
| RUN --mount=type=cache,target=/root/.cache/ccache \ | ||
| --mount=type=cache,target=/root/.cache/uv \ | ||
| cd vllm \ | ||
| && uv pip install --system -r requirements/rocm-build.txt \ |
There was a problem hiding this comment.
Why is rocm-build.txt being used in the docker build?
There was a problem hiding this comment.
That's an oversight on my part, thought it was just rocm.txt. I updated that as well.
docker/Dockerfile.rocm
Outdated
| COPY requirements/rocm-build.txt requirements/rocm-build.txt | ||
| COPY pyproject.toml setup.py CMakeLists.txt ./ | ||
| COPY cmake cmake/ | ||
| COPY csrc csrc/ |
There was a problem hiding this comment.
Are you copying host files here? The point of REMOTE_VLLM is exactly to not do this
There was a problem hiding this comment.
I refactored based on offline conversation to bring back the old way of doing this and avoid any trouble. I also integrated another recommended point which is a per-arch build so that we then use a specific docker dependency and not an all-arch dependency. Hope it looks better now :)
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
First successful build with vllm-project/ci-infra#307: https://buildkite.com/vllm/amd-ci/builds/6536/steps/canvas |
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…e condition in highly concurrent max job settings Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
@mawong-amd Let's check if |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ker_build Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ker_build Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Summary
Implements the three-tier Docker build described in the RFC (#34994) for ROCm CI. Every PR currently rebuilds RIXL, DeepEP, rocshmem, torchcodec, and RDMA libraries from scratch — costing ~11–19 minutes of pure overhead per build. This PR introduces a pre-built Tier-1
ci_baseimage that absorbs those stable layers. Per-PR builds then only rebuild the thin vLLM wheel + workspace layer (~2–3 minutes).Image registry layout after this PR:
rocm/vllm-dev:baseDockerfile.rocm_baserocm/vllm-dev:ci_baseci_basestage (this PR)rocm/vllm-ci:$COMMITteststage (this PR)docker/Dockerfile.rocm — global ARG CI_BASE_IMAGE
+ARG CI_BASE_IMAGE=rocm/vllm-dev:ci_baseDocker requires any
ARGused in aFROMinstruction to be declared before the firstFROMin the file.CI_BASE_IMAGEcontrols which image theteststage inherits from. The defaultrocm/vllm-dev:ci_basepoints to the stable weekly-built Tier-1 image on Docker Hub. When building--target ci_base(the weekly scheduled build), this ARG is irrelevant because theci_basestage inherits frombase, notCI_BASE_IMAGE.docker/Dockerfile.rocm — new
ci_basestage+FROM base AS ci_baseA new intermediate Docker build stage inserted between the existing
export_vllm_wheel_releaseandteststages. Everything in this stage is stable — it changes only when pinned dependency branches (e.g.RIXL_BRANCH,DEEPEP_BRANCH) or the upstreambaseimage change. Building this stage takes ~10–18 minutes; by caching it asrocm/vllm-dev:ci_base, every per-PRtestbuild avoids repeating that work.What goes into
ci_baseand why:testbuild_rixl)RIXL_BRANCHbuild_deepep)DEEPEP_BRANCHbuild_rocshmem)ROCSHMEM_BRANCHlibrdmacm1,libibverbs1, …)hf_transfer+pytest-shardHF_HUB_ENABLE_HF_TRANSFER=1MIOPEN_DEBUG_CONV_DIRECT/GEMM=0docker/Dockerfile.rocm —
teststage now inherits from${CI_BASE_IMAGE}The
teststage now starts fromrocm/vllm-dev:ci_base(pulled from registry) instead ofbase. Docker pulls the pre-built Tier-1 image and adds only the PR-specific layers on top:rocm.txt,rocm-test.txt) + the PR wheel/opt/vllm-wheels/forpython_only_compile_rocm.sh/vllm-workspace) +vllm_test_utilsinstallvllm/v1package +src/vllmlayoutAll of these vary every PR, so they can't be pre-baked. The net result is that the per-PR build touches only ~2–3 minutes of work instead of 15+.
When
CI_BASE_IMAGEis set to the local build-stage nameci_base(e.g. in adocker buildx bake ci-base-rocm-cirun), Docker resolves it as the locally-built stage rather than pulling from the registry, so the weekly Tier-1 build itself is not broken by this change..buildkite/hardware_tests/amd.yaml — docker pull + new env vars
Applied to all four build steps (
all-archs,gfx90a,gfx942,gfx950).docker pull rocm/vllm-dev:ci_basePre-fetches the Tier-1 image before
docker buildx bakeruns. Without this, BuildKit would pull it lazily during the build and the pull would not appear in plain-progress output, making timing harder to diagnose. Also ensures the pull fails fast and visibly if the weekly image is missing.CI_BASE_IMAGE: "rocm/vllm-dev:ci_base"Passed to
docker buildx bakeas a Docker--build-arg(via the_ci-rocmtarget inci-rocm.hclin ci-infra). TellsDockerfile.rocmwhich image to use as the base of theteststage. Without this,docker buildx bakewould use the HCL variable default, which may not match the registry tag.REMOTE_VLLM: "1"Dockerfile.rocmhas two code paths for getting the vLLM source:REMOTE_VLLM=0(default, local dev):COPYfrom the local build context.REMOTE_VLLM=1(CI path):git clone $VLLM_REPO --branch $VLLM_BRANCHinside the build.In CI the build agent runs
docker buildx bakefrom the vllm repo checkout, but the vLLM source to be tested is the PR commit. SettingREMOTE_VLLM=1tells the Dockerfile to clone from GitHub at build time usingVLLM_BRANCHbelow.VLLM_BRANCH: "${BUILDKITE_COMMIT}"The git ref that Docker clones when
REMOTE_VLLM=1. Set to the exact commit SHA being tested (Buildkite expands${BUILDKITE_COMMIT}at pipeline-step evaluation time). Without this, the build would check outmainregardless of which PR commit triggered the build — producing a test image that does not match the PR..buildkite/hardware_tests/amd-ci-base.yaml — new scheduled pipeline
New pipeline file that runs weekly (or on-demand when dependency branches change) to build and push
rocm/vllm-dev:ci_base.DATED_TAG/CI_BASE_IMAGE_TAG_DATEDThe dated snapshot tag (e.g.
rocm/vllm-dev:ci_base-20250330). Computed at runtime so it always reflects today's date. Used for rollback: if the weekly build introduces a regression, you can pinCI_BASE_IMAGEto a specific dated tag inamd.yamlwithout rebuilding.IMAGE_TAG(set to the dated tag)ci-bake.shchecks whether$IMAGE_TAGalready exists in the registry before building. Setting it to the dated tag means the weekly build always runs on a new day (the dated tag does not exist yet), and re-running on the same day is idempotent (skipped).CI_BASE_IMAGE_TAGThe stable
rocm/vllm-dev:ci_basetag that per-PR builds pull. Always pushed by this pipeline, so it always points to the most recent weekly build.DOCKERHUB_CACHE_TO: "rocm/vllm-ci-cache:rocm-latest"Tells BuildKit to write the layer cache to Docker Hub after the weekly build. This seeds the
:rocm-latestcache tag so that per-PR builds (which read from this cache) get warm layers from a recent main-branch state.This PR is connected to: vllm-project/ci-infra#307
These two PRs should likely be merged simultaneously.
cc @kenroche @okakarpa @tjtanaa @gshtras @khluu
Co-authored-by: Claude claude@anthropic.com